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ABSTRACT 

We investigate the problem of transforming an input sequence into a 
high-dimensional output sequence in order to transcribe polyphonic 
audio music into symbolic notation. We introduce a probabilistic 
model based on a recurrent neural network that is able to learn real- 
istic output distributions given the input and we devise an efficient 
algorithm to search for the global mode of that distribution. The re- 
sulting method produces musically plausible transcriptions even un- 
der high levels of noise and drastically outperforms previous state- 
of-the-art approaches on five datasets of synthesized sounds and real 
recordings, approximately halving the test error rate. 

Index Terms — Sequence transduction, restricted Boltzmann 
machine, recurrent neural network, polyphonic transcription 

1. INTRODUCTION 

Machine learning tasks can often be formulated as the transforma- 
tion, or transduction, of an input sequence into an output sequence: 
speech recognition, machine translation, chord recognition or auto- 
matic music transcription, for example. Recurrent neural networks 
(RNN) [1] offer an interesting route for sequence transduction [2] 
because of their ability to represent arbitrary output distributions in- 
volving complex temporal dependencies at different time scales. 

When the output predictions are high-dimensional vectors, such 
as tuples of notes in musical scores, it becomes very expensive to 
enumerate all possible configurations at each time step. One possible 
approach is to capture high-order interactions between output vari- 
ables using restricted Boltzmann machines (RBM) [3] or a tractable 
variant called NADE [4], a weight- sharing form of the architecture 
introduced in [5]. In a recently developed probabilistic model called 
the RNN-RBM, a series of distribution estimators (one at each time 
step) are conditioned on the deterministic output of an RNN [6, 7]. In 
this work, we introduce an input/output extension of the RNN-RBM 
that can learn to map input sequences to output sequences, whereas 
the original RNN-RBM only learns the output sequence distribution. 
In contrast to the approach of [2] designed for discrete output sym- 
bols, or one-hot vectors, our high-dimensional paradigm requires a 
more elaborate inference procedure. Other differences include our 
use of second-order Hessian-free (HF) [8] optimization 1 but not of 
LSTM cells [9] and, for simplicity and performance reasons, our use 
of a single recurrent network to perform both transcription and tem- 
poral smoothing. We also do not need special "null" symbols since 
the sequences are already aligned in our main task of interest: poly- 
phonic music transcription. 
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The objective of polyphonic transcription is to obtain the under- 
lying notes of a polyphonic audio signal as a symbolic piano-roll, 
i.e. as a binary matrix specifying precisely which notes occur at 
each time step. We will show that our transduction algorithm pro- 
duces more musically plausible transcriptions in both noisy and nor- 
mal conditions and achieve superior overall accuracy [10] compared 
to existing methods. Our approach is also an improvement over the 
hybrid method in [6] that combines symbolic and acoustic models by 
a product of experts and a greedy chronological search, and [11] that 
operates in the time domain under Markovian assumptions. Finally, 
[12] employs a bidirectional RNN without temporal smoothing and 
with independent output note probabilities. Other tasks that can be 
addressed by our transduction framework include automatic accom- 
paniment, melody harmonization and audio music denoising. 

2. PROPOSED ARCHITECTURE 

2.1. Restricted Boltzmann machines 

An RBM is an energy-based model where the joint probability of a 
given configuration of the visible vector v £ {0,1}^ (output) and 
the hidden vector h is: 

P(v, h) = exp(-6^ -blh- h T Wv) /Z (1) 

where b v , bu and W are the model parameters and Z is the usually 
intractable partition function. The marginalized probability of v is 
related to the free-energy F(v) by P(v) = e~ F(v) /Z: 

F(v) = -b T v v - log(l + e h »+ Wv )i (2) 

i 

The gradient of the negative log-likelihood of an observed vector v 
involves two opposing terms, called the positive and negative phase: 

d(-\ogP(v)) = dF{v) _ d{-\ogZ) 

oe do oe { } 

where G = {b Vj bh, W}. The second term can be estimated by a 
single sample v* obtained from a Gibbs chain starting at v: 

d(-\ogP(v)) ^ dF(v) _ dF(v*) 

oe ~ oe oe ' 1 ; 

resulting in the well-known contrastive divergence algorithm [13]. 

2.2. NADE 

The neural autoregressive distribution estimator (NADE) [4] is a 
tractable model inspired by the RBM. NADE is similar to a fully 
visible sigmoid belief network in that the conditional probability dis- 
tribution of a visible unit Vj is expressed as a nonlinear function of 
the vector v<j = {vk, V/c < j}: 

P( Vj = l\v Kj ) = «'M\J>j + (b v )j) (5) 




Fig. 1. Graphical structure of the I/O RNN-RBM. Single arrows rep- 
resent a deterministic function, double arrows represent the hidden- 
visible connections of an RBM, dotted arrows represent optional 
connections for temporal smoothing. The x —> {v, h} connections 
have been omitted for clarity at each time step except the last. 

hj = a(W : , <j v <j + b h ) (6) 

where a(x) = (1 + e~ x )~ x is the logistic sigmoid function. 

In the following discussion, one can substitute RBMs with 
NADEs by replacing equation (4) with the exact gradient of the 
negative log-likelihood cost C = — log P(v)\ 

BC 

m -=P( Vi = l\ v<i) - Vj (7) 
db h d(b v )k 

N 

K — J + l 

In addition to the possibility of using HF for training, a tractable 
distribution estimator is necessary to compare the probabilities of 
different output sequences during inference. 

2.3. The input/output RNN-RBM 

The I/O RNN-RBM is a sequence of conditional RBMs (one 
at each time step) whose parameters bv\b^\w^ are time- 
dependent and depend on the sequence history at time t, denoted 
= {x (t) ,v (t) |t < t} where {v^} are respectively 

the input and output sequences. Its graphical structure is depicted in 
Figure 1. Note that by ignoring the input x, this model would reduce 
to the RNN-RBM [6]. The I/O RNN-RBM is formally defined by 
its joint probability distribution: 

T 

P({v (t) }) = l[P(v (t) \A (t) ) (10) 

t=i 

where the right-hand side multiplicand is the marginalized probabil- 
ity of the t th RBM (eq. 2) or NADE (eq. 5). 

Following our previous work, we will consider the case where 
only the biases are variable: 

= b h + W^W-V + W xh x (t) (11) 

=b v + W kv h {t - 1] + W xv x (t) (12) 
where are the hidden units of a single-layer RNN: 

h (t) = °{W vk v^ + W hh h^ + W xK x™ + b- h ) (13) 



where the indices of weight matrices and bias vectors have obvious 
meanings. The special case W vk — gives rise to a transcription 
network without temporal smoothing. Gradient evaluation is based 
on the following general scheme: 

1. Propagate the current values of the hidden units in the RNN 
portion of the graph using (13), 

2. Calculate the RBM or NADE parameters that depend on h (t) , 
(eq. 11-12) and obtain the log-likelihood gradient with respect to 
W, b^ and b^ (eq. 4 or eq. 7-9), 

3. Propagate the estimated gradient with respect to bv\ b^ back- 
ward through time (BPTT) [1] to obtain the estimated gradient 
with respect to the RNN parameters. 

By setting W = 0, the I/O-RNN-RBM reduces to a regular 
RNN that can be trained with the cross-entropy cost: 

L({v (t) }) = ^ E E -«f logpf - (1 - v?) log(l - pf) 

t=l j = l 

(14) 

where = a(b^) and equations (12) and (13) hold. We will use 
this model as one of our baselines for comparison. 

A potential difficulty with this training scenario stems from the 
fact that since v is known during training, the model might (under- 
standably) assign more weight to the symbolic information than the 
acoustic information. This form of teacher forcing during training 
could have dangerous consequences at test time, where the model is 
autonomous and may not be able to recover from past mistakes. The 
extent of this condition obviously depends on the ambiguousness of 
the audio and the intrinsic predictability of the output sequences, 
and can also be controlled by introducing noise to either x^ or 
v^ T \r < t, or by adding the regularization terms a(|Wa^| 2 + 
\W xh \ 2 ) + f3(\ W hv | 2 + 1 W hh | 2 ) to the objective function. It is trivial 
to revise the stochastic gradient descent updates to take those penal- 
ties into account. 

3. INFERENCE 

A distinctive feature of our architecture are the (optional) connec- 
tions v — » h that implicitly tie to its history and encourage 
coherence between successive output frames, and temporal smooth- 
ing in particular. At test time, predicting one time step requires 
the knowledge of the previous decisions on (for r < t) which 
are yet uncertain (not chosen optimally), and proceeding in a greedy 
chronological manner does not necessarily yield configurations that 
maximize the likelihood of the complete sequence 2 . We rather fa- 
vor a global search approach analogous to the Viterbi algorithm for 
discrete- state HMMs. Since in the general case the partition function 
of the t th RBM depends on A^\ comparing sequence likelihoods 
becomes intractable, hence our use of the tractable NADE. 

Our algorithm is a variant of beam search for high-dimensional 
sequences, with beam width w and maximal branching factor K (Al- 
gorithm 1). Beam search is a breadth-first tree search where only the 
w most promising paths (or nodes) at depth t are kept for future 
examination. In our case, a node at depth t corresponds to a subse- 
quence of length t, and all descendants of that node are assumed to 
share the same sequence history .A (t+1) ; consequently, only is 
allowed to change among siblings. This structure facilitates identify- 
ing the most promising paths by their cumulative log-likelihood. For 

2 Note that without temporal smoothing (W v % = 0), the , 1 < t < T 
would be conditionally independent given x and the prediction could simply 
be obtained separately at each time step t. 



Algorithm 1 HIGH-DIMENSIONAL BEAM SEARCH 

Find the most likely sequence {v^\ 1 < t < T} under a model m 
with beam width w and branching factor K. 

1: q <— min-priority queue 

2: g.insert(0, m) 

3: for t = 1 ... T do 

4: q <(— min-priority queue of capacity iu * 
5: while l,m <— g.pop() do 
6: for v' in ra.findjnost-probable^) do 
7: ra' «— ra with := 

8: </.insert(Z + Z',ra') 

9: g <- 
10: return g.pop() 

* A mm-priority queue of fixed capacity w maintains (at most) the w 
highest values at all times. 



high-dimensional output however, any non-leaf node has exponen- 
tially many children (2^), which in practice limits the exploration to 
a fixed number K of siblings. This is necessary because enumerat- 
ing the configurations at a given time step by decreasing likelihood is 
intractable (e.g. for RBM or NADE) and we must resort to stochastic 
search to form a pool of promising children at each node. Stochas- 
tic search consists in drawing S samples of v^\A^ and keeping 
the K unique most probable configurations. This procedure usually 
converges rapidly with S ~ 10 K samples, especially with strong bi- 
ases coming from the conditional terms. Note that w = 1 or K = 1 
reduces to a greedy search, and w = 2 NT , K = 2 N corresponds to 
an exhaustive breadth-first search. 

When the output units < j < N are conditionally inde- 
pendent given such as for a regular RNN (eq. 14), it is pos- 
sible to enumerate configurations by decreasing likelihood using a 
dynamic programming approach (Algorithm 2). This very efficient 
algorithm in 0(K log K + N log N) is based on linearly growing 
priority queues, where K need not be specified in advance. Since 
inference is usually the bottleneck of the computation, this optimiza- 
tion makes it possible to use much higher beam widths w with un- 
bounded branching factors for RNNs. 

Algorithm 2 Independent outputs inference 

Enumerate the K most probable configurations of N independent 

Bernoulli random variables with parameters < pi < 1. 

1: v «— {i : pi > 1 /2} 

2: lo <- J2i log(max(pi, 1 - pi)) 

3: yield lo, vo 

4: Li^llogJi-l 

5: sort L, store corresponding permutation R 

6: q <(— min-priority queue 

7: g.insert(L , {0}) 

8: while l,v <— g.pop() do 

9: yield / - l,v AR[v] * 
10: i ^— max(v) 
11: ifi + l<iVthen 
12: <?.insert(Z + L i+1 , v U {i + 1}) 

13: g.insert(/ + L i+1 - Li, v U {i + 1} \ {i}) 

*AAB = (A U B) \ (A n B) denotes the symmetric difference of 
two sets. R[v] indicates the i?-permutation of indices in the set v. 



A pathological condition that sometimes occurs with beam 
search over long sequences (T ^> 200) is the exponential dupli- 
cation of highly likely quasi-identical paths differing only at a few 



Dataset 


HMM [16] 


RNN-RBM [6] 


Proposed 


Piano-midi.de 


59.5% 


60.8% 


64.1% 


Nottingham 


71.4% 


77.1% 


97.4% 


MuseData 


35.1% 


44.7% 


66.6% 


JSB Chorales 


72.0% 


80.6% 


91.7% 



Table 1. Frame-level transcription accuracy obtained on four 
datasets by the Nam et al. algorithm with HMM temporal smooth- 
ing [16], using the RNN-RBM musical language model [6], or the 
proposed I/O RNN-NADE model. 



time steps, that quickly saturate beam width with essentially useless 
variations. Several strategies have been tried with moderate success 
in those cases, such as committing to the most likely path every M 
time steps (periodic restarts [14]), pruning similar paths, or pruning 
paths with identical r previous time steps (the local assumption), 
where r is a maximal time lag that the chosen architecture can rea- 
sonably describe (e.g. r ~ 200 for RNNs trained with HF). It is also 
possible to initialize the search with Algorithm 1 then backtrack at 
each node iteratively, resulting in an anytime algorithm [15]. 

4. EXPERIMENTS 

In the following experiments, the acoustic input is constituted 
of powerful DBN-based learned representations [16]. The magni- 
tude spectrogram is first computed by the short-term Fourier trans- 
form using a 128 ms sliding Blackman window truncated at 6 kHz, 
normalized and cube root compressed to reduce the dynamic range. 
We apply PC A whitening to retain 99% of the training data variance, 
yielding roughly 30-70% dimensionality reduction. A DBN is then 
constructed by greedy layer-wise stacking of sparse RBMs trained 
in an unsupervised way to model the previous hidden layer expecta- 
tion (v l+1 = E[h l \v 1 ]) [17]. The whole network is finally finetuned 
with respect to a supervised criterion (e.g. eq. 14) and the last layer 
is then used as our input for the spectrogram frame at time t. 

We evaluate our method on five datasets of varying complexity: 
Piano-midi.de, Nottingham, MuseData and JSB chorales (see [6]) 
which are rendered from piano and orchestral instrument soundfonts, 
and Poliner & Ellis [18] that comprises synthesized sounds and real 
recordings. We use frame-level accuracy [10] for model evaluation. 
Hyperparameters are selected by a random search [19] on predefined 
intervals to optimize validation set accuracy; final performance is 
reported on the test set. 

Table 1 compares the performance of the I/O RNN-RBM to the 
HMM baseline [16] and the RNN-RBM hybrid approach [6] on four 
datasets. Contrarily to the product of experts of [6], our model is 
jointly trained, which eliminates duplicate contributions to the en- 
ergy function and the related increase in marginals temperature, and 
provides much better performance on all datasets, approximately 
halving the error rate in average over these datasets. 

We now assess the robustness of our algorithm to different types 
of noise: white noise, pink noise, masking noise and spectral dis- 
tortion. In masking noise, parts of the signal of exponentially dis- 
tributed length (fi — 0.4 s) are randomly destroyed [20]; spectral 
distortion consists in Gaussian pitch shifts of amplitude a [21]. The 
first two types are simplest because a network can recover from 
them by averaging neighboring spectrogram frames (e.g. Kalman 
smoothing), whereas the last two time-coherent types require higher- 
level musical understanding. We compare a bidirectional RNN [12] 
adapted for frame-level transcription, a regular RNN with v —> h 
connections (w = 2000) and the I/O RNN-NADE (w = 50, K = 
10). Figure 2 illustrates the importance of temporal smoothing con- 
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Fig. 2. Robustness to different types of noise of various RNN-based 
models on the JSB chorales dataset. 



SONIC [22] 


39.6% 


Note events + HMM [23] 


46.6% 


Linear SVM [18] 


67.7% 


DBN + SVM [16] 


72.5% 


BLSTM RNN [12] 


75.2% 


AdaBoost cascade [24] 


75.2% 


I/O-RNN-NADE 


79.1% 



Table 2. Frame-level accuracy of existing transcription methods on 
the Poliner & Ellis dataset [18]. 

nections and the additional advantage provided by conditional dis- 
tribution estimators. Beam search is responsible for a 0.5% to 18% 
increase in accuracy over a greedy search (w = 1). 

Figure 3 shows transcribed piano-rolls for various RNNs on 
an excerpt of Bach's chorale Es ist genug with 6 dB pink noise 
(Fig. 3(a)). We observe that a bidirectional RNN is unable to per- 
form temporal smoothing on its own (Fig. 3(b)), and that even a 
post-processed version (Fig. 3(c)) can be improved by our global 
search algorithm (Fig. 3(d)). Our best model offers an even more 
musically plausible transcription (Fig. 3(e)). Finally, we compare 
the transcription accuracy of common methods on the Poliner & 
Ellis [18] dataset in Table 2, that highlights impressive performance. 
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Fig. 3. Demonstration of temporal smoothing on an excerpt of 
Bach's chorale Es ist genug (BWV 60.5) with 6 dB pink noise. Fig- 
ure shows (a) the raw magnitude spectrogram, and transcriptions by 
(b) a bidirectional RNN, (c) a bidirectional RNN with HMM post- 
processing, (d) an RNN with v — ► h connections (w = 75) and (e) 
I/O-RNN-NADE (w = 20, K = 10). Predicted piano-rolls (black) 
are interleaved with the ground-truth (white) for comparison. 



5. CONCLUSIONS 

We presented an input/output model for high-dimensional sequence 
transduction in the context of polyphonic music transcription. Our 
model can learn basic musical properties such as temporal continu- 
ity, harmony and rhythm, and efficiently search for the most musi- 
cally plausible transcriptions when the audio signal is partially de- 
stroyed, distorted or temporarily inaudible. Conditional distribution 
estimators are important in this context to accurately describe the 



density of multiple potential paths given the weakly discriminative 
audio. This ability translates well to the transcription of "clean" sig- 
nals where instruments may still be buried and notes occluded due to 
interference, ambient noise or imperfect recording techniques. Our 
algorithm approximately halves the error rate with respect to com- 
peting methods on five polyphonic datasets based on frame-level ac- 
curacy. Qualitative testing also suggests that a more musically rele- 
vant metric would enhance the advantage of our model, since tran- 
scription errors often constitute reasonable alternatives. 
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